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Abstract 

Support vector machines (SVMs) are an extremely successful type of classifica- 
tion and regression algorithms. Building an SVM entails solving a constrained 
convex quadratic programming problem, which is quadratic in the number of 
training samples. We introduce an efficient parallel implementation of an sup- 
port vector regression solver, based on the Gaussian Belief Propagation algo- 
rithm (GaBP). 

In this paper, we demonstrate that methods from the complex system domain 
could be utilized for performing efficient distributed computation. We com- 
pare the proposed algorithm to previously proposed distributed and single-node 
SVM solvers. Our comparison shows that the proposed algorithm is just as 
accurate as these solvers, while being significantly faster, especially for large 
datasets. We demonstrate scalability of the proposed algorithm to up to 1,024 
computing nodes and hundreds of thousands of data points using an IBM Blue 
Gene supercomputer. As far as we know, our work is the largest parallel im- 
plementation of belief propagation ever done, demonstrating the applicability 
of this algorithm for large scale distributed computing systems. 



1 Introduction 

Support-vector machines (SVMs) are a class of algorithms that have, in recent years, exhibited 
superior performance compared to other pattern classification algorithms. There are several 
formulations of the SVM problem, depending on the specific application of the SVM (e.g., 
classification, regression, etc.). 

One of the difficulties in using SVMs is that building an SVM requires solving a constrained 
quadratic programming problem, whose size is quadratic in the number of training examples. 
This fact has led to extensive research on efficient SVM solvers. Recently, several researchers 
have suggested using multiple computing nodes in order to increase the computational power 
available for solving SVMs. 
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In this article, we introduce a distributed SVM solver based on the Gaussian Belief Propagation 
(GaBP) algorithm. We improve on the original GaBP algorithm by reducing the communication 
load, as represented by the number of messages sent in each optimization iteration, from 0{n?) 
to 0{n) aggregated messages, where n is the number of data points. Previously, it was known 
that the GaBP algorithm is very efficient for sparse matrices. Using our novel construction, we 
demonstrate that the algorithm exhibits very good performance for dense matrices as well. We 
also show that the GaBP algorithm can be used with kernels, thus making the algorithm more 
powerful than what was considered previously thought possible. 

Using extensive simulation we demonstrate the applicability of our protocol vs. the state-of- 
the-art existing parallel SVM solvers. Using a Linux cluster of up to a hundred machines and 
the IBM Blue Gene supercomputer we managed to solve very large data sets up to hundreds of 
thousands data point, using up to 1,024 CPUs working in parallel. Our comparison shows that 
the proposed algorithm is just as accurate as these previous solvers, while being significantly 
faster. 

A preliminary version of this paper appeared as a poster in [20]. 

1.1 Classification Using Support Vector Machines 

We begin by formulating the SVM problem. Consider a training set: 

D = {{-K,,yi), i = l,...,N, X, £3?", e{-l,l}}. (1) 

The goal of the SVM is to learn a mapping from to yi such that the error in mapping, as 
measured on a new dataset, would be minimal. SVMs learn to find the linear weight vector that 
separates the two classes so that 

(xi-w + &) > 1 /or i = l,...,iV. (2) 

There may exist many hyperplanes that achieve such separation, but SVMs find a weight vector 
w and a bias term h that maximize the margin 2/ ||w||. Therefore, the optimization problem 
that needs to be solved is ^ 

min Jz5(w) = - ||w|| , (3) 
Subject to (xi ■ w + fe) > 1 for i 1, . . . , N. (4) 

Any points lying on the hyperplane (xi ■ w + 6) = 1 are called support vectors. 

If the data cannot be separated using a linear separator, a slack variable ^ > is introduced and 
the constraint is relaxed to: 

2/,(xi-w + 6)>l-^, /or z = l,...,iV. (5) 
The optimization problem then becomes: 

minJ,5(w) = -||wI|+C^6, (6) 

1=1 

subject to yi (xi • w + fe) > 1 for i ~ 1, . . . ,N, (7) 

6>0 for 1 = 1,..., N. (8) 

The weights of the linear function can be found directly or by converting the problem into its 
dual optimization problem, which is usually easier to solve. 

Using the notation of Vijayakumar and Wu [7j, the dual problem is thus: 

max Loih) =^h^-^h' ■ D ■ h, (9) 

i 

subject to 0<h,<C, i = 1, N, (10) 
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^ihiUi = 0. 



(11) 



where D '\s a matrix such that Dij = yiVjK (x^, x^) and K (•, •) is either an inner product of the 
samples or a function of these samples. In the latter case, this function is known as the kernel 
function, which can be any function that complies with the Mercer conditions [9J. For example, 
these may be polynomial functions, radial-basis (Gaussian) functions, or hyperbolic tangents. 
If the data is not separable, C is a tradeoff between maximizing the margin and reducing the 
number of misclassifications. 

The classification of a new data point is then computed using the following equation: 

{x)^signi ^ hitjiK {xi, x) + b \ (12) 

\i£SV J 

1.2 Kernel Ridge Regression problem 

Kernel Ridge Regression (KRR) implements a regularized form of the least squares method useful 
for both regression and classification. The non-linear version of KRR is similar to the Support- 
Vector Machine (SVM) problem. However, in the latter, special emphasis is given to points 
close to the decision boundary, which is not provided by the cost function used by KRR. 

Given training data 

V = {xi,yi}[^j^, X, eR'^ ,yieR 

the KRR algorithm determines the parameter vector w e i?'' of a non-linear model (using the 
"kernel trick"), via minimization of the following objective function: [5j: 

minA||w|p+^(y,-w^$(x,))' 

1=1 

where A is a tradeoff parameter between the two terms of the optimization function, and $() is 
a (possible non-linear) mapping of the training patterns. 

One can show that the dual form of this optimization problem is given by: 

maxTy(a) =y^a + l/4Aa^Ka-l/4a^a (13) 

where K is a matrix whose («, j)-th entry is the kernel function j — $(xi)-^<I>(xj). 

The optimal solution to this optimization problem is: 

a = 2A(K + AI)-V 

The corresponding prediction function is given by: 

/(x) =w^$(x) -y^(K + AI)-iK(x„x). 

The underlying assumption used is that the kernel matrices are invertible. 

1.3 Previous Approaches for Solving Parallel SVMs 

There are several main methods for finding a solution to an SVM problem on a single-node 
computer. (See Chapter 10 of [9J) for a taxonomy of such methods.) However, since solving 
an SVM is quadratic in time and cubic in memory, these methods encounter difFiculty when 
scaling to datasets that have many examples and support vectors. The latter two are not 
synonymous. A large dataset with many repeated examples might be solved using sub-sampling 
approaches, while a highly non-separable dataset with many support vectors will require an 
altogether different solution strategy. The literature covers several attempts at solving SVMs 
in parallel, which allow for greater computational power and larger memory size. In Collobert 
et al. [15j the SVM solver is parallelized by training multiple SVMs, each on a subset of the 
training data, and aggregating the resulting classifiers into a single classifier. The training data 
is then redistributed to the classifiers according their performance and the process is iterated 
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until convergence is reached. The need to re-divide the data among the SVM classifiers means 
that the data must be moved between nodes several times; this rules out the use of an approach 
where bandwidth is a concern. A more low-level approach is taken by Zanghirati et al. [14], 
where the quadratic optimization problem is divided into smaller quadratic programs (similar to 
the Active Set methods), each of which is solved on a different node. The results are aggregated 
and the process is repeated until convergence. The performance of this method has a strong 
dependence on the caching architecture of the cluster. Graf et al. [17] partition the data and 
solve an SVM for each partition. The support vectors from each pair of classifiers are then 
aggregated into a new training set for which an SVM is solved. The process continues until a 
single classifier remains. The aggregation process can be iterated, using the support vectors of 
the final classifier in the previous iteration to seed the new classifiers. One problem with this 
approach is that the data must be repeatedly shared between nodes, meaning that once again 
the goal of data distribution cannot be attained. The second problem, which might be more 
severe, is that the number of possible support vectors is restricted by the capacity of a single 
SVM solver. Yom Tov j8] proposed modifying the sequential algorithm developed in [7] to batch 
mode. In this way, the complete kernel matrix is held in distributed memory and the Lagrange 
multipliers are computed iteratively. This method has the advantage that it can efficiently solve 
difficult SVM problems that have many support vectors to their solution. Based on that work, 
we show in this paper how an SVM solution can be obtained by adapting a Gaussian Belief 
Propagation algorithm to the solution of the algorithm proposed in [J- 

Recently, Hazan et al. proposed an iterative algorithm for parallel decomposition based on Fenchel 
Duality [23J. Zanni et a/, proposes a decomposition method for computing SVM in parallel [22J. 
We compare our run time results to both systems in Section [5] 

2 Gaussian Belief Propagation 

In this section we present our novel contribution - a Gaussian Belief Propagation solver for 
distributed computation of the SVM problem [I9l ITHl |2T| . 

Following, we provide a step-by-step derivation of our GaBP solver from our proposed cost 
function. As stated in the previous section, our aim is to find x*, a solution to the quadratic 
cost function given in eq. ([9]). Using linear algebra notation, we can rewrite the same cost 
function: niin_E(x) = x^Wx/2 — y-^x. 

As the matrix W is symmetric Q {e.g. , W = S^S, the derivative of the quadratic form with 
respect to the vector x is given by E'{x.) = Wx — y. 

Thus, equating -E'(x) — gives the global minimum x* of this convex function, which is the 
desired solution x* = W^^y. 

Now, one can define the following yo/nt/y Gaussian distribution 

p(x) = Z-^ exp ( - E{x.)) = Z-^ exp (-x'^Wx/2 + y^x), (14) 

where Z is a distribution normalization factor. Defining the vector fj, = W^^y, one gets the 
form 

p{x) = Z-iexp(^^W^/2) 

X exp(-x'^Wx/2 + ^'^Wx-/i^W^/2) 

= r^cxp(-i(x-M)^W(x-/i)) 

= AA(At,W-i), (15) 

where the new normalization factor C, = Z exp To summarize to this point, the 

target solution x* — W~^y is equal to /i = W^^y, which is the mean vector of the distribution 
p(x) , as defined in eq. ()14p . 

^an extension to non-symmetric matrices is discussed in ^18j . For simplicity of arguments, we handle 
the symmetric case in this paper 
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The formulation above allows us to shift the rating problem from an algebraic to a probabilistic 
domain. Instead of solving a deterministic vector-matrix linear equation, we now solve an infer- 
ence problem in a graphical model describing a certain Gaussian distribution function. Given the 
Peer-to-Peer graph W and the prior vector y, one knows how to write explicitly p(x) (|14p and 
the corresponding graph Q with edge potentials (compatibility functions) jpij and self-potentials 
('evidence') 0^. These graph potentials are determined according to the following pairwise fac- 
torization of the Gaussian distribution p(x) (fT4|l 

K 

p{x) (xY\_(t'i{xt) Y\_ Ajixt^Xj), (16) 
»=i {hj} 

resulting in ■ipij{xi,Xj) = exp{—XiAijXj) and 4>i{xi) = exp (biXi — Aaxj/i). The set of edges 
{i, i} corresponds to the set of network edges W. Hence, we would like to calculate the marginal 
densities, which must also be Gaussian, 

p{x,)^N{ti^ = {W-^y},,p-^ = {W-i},,), 

where /i^ and Pi are the marginal mean and inverse variance (a.k.a. precision), respectively. 
Recall that, according to our previous argumentation, the inferred mean /i; is identical to the 
desired solution x*. 

The move to the probabilistic domain calls for the utilization of BP as an efficient inference 
engine. The sum-product rule of BP for continuous variables, required in our case, is given 
by [1] 




where mij{xj) is the message sent from node i to node j over their shared edge on the graph, 
scalar a is a normalization constant and the set denotes all the nodes neighboring Xi, 

except Xj. The marginals are computed according to the product rule [I] 

p{xi) ^ a(f>i{xi) J]^ nikiixi). (18) 

GaBP is a special case of continuous BP where the underlying distribution is Gaussian. In [21J 
we show how to derive the GaBP update rules by substituting Gaussian distributions in the 
continuous BP equations. The output of this derivation is update rules that are computed 
locally by each node. The GaBP-based implementation of the Peer-to-Peer rating algorithm is 
summarized in Table 1. 

Algorithm 1 can be easily executed distributively. Each node i receives as an input the i'th row 
(or column) of the matrix W and the scalar bi. In each iteration, a message containing two 
reals, /iy and P^ , is sent to every neighboring node through their mutual edge, corresponding 
to a non-zero Aij entry. 

Convergence. If it converges, GaBP is known to result in exact inference [1]. In contrast 
to conventional iterative methods for the solution of systems of linear equations, for GaBP, 
determining the exact region of convergence and convergence rate remain open research prob- 
lems. All that is known is a sufficient (but not necessary) condition [4 5j stating that GaBP 
converges when the spectral radius satisfies pdl^f — W|) < 1. A stricter sufficient condi- 
tion [1], actually proved earlier, determines that the matrix W must be diagonally dominant 
{i.e. , \Wu\ > J2j^i '"^ order for GaBP to converge. 

Efficiency. The local computation at a node at each round is fairly minimal. Each node i 
computes locally two scalar values /iy, Pij for each neighbor j G N{i). Convergence time 
is dependent on both the inputs and the network topology. Empirical results are provided in 
Section [5] 
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Algorithm 1 



1 . 


Initialize : 


/ 


Set the neighborhood N(i) to include 








Vfc ^ i3AM ^ 0. 






/ 

V 


Sst ths sca.la.r fixes 








Pa = Ajj and Lin — hi 1 Aji , 






/ 


Set the initial N(z) i5 /c — > % scalar messages 








Pi'i = and iii-^ — 






/ 


Set a convergence threshold 6. 


2 . 


Tt e Vrj t P * 


/ 


Propagate the N(z) 3 — > i messages 








Pki and Hki, Vi (under certain scheduling) . 






/ 


Compute the N(j) 9 i ^ j scalar messages 








Pij — —Afj/ [Pii + X]/ceN(i)\j ^fci)' 








— i P^^tiAA A- N ^ , , Pt.^iii.^] A^A 

h^ij \^ ith^ii Z^fcgN(i)\j kiH'Ki J / -^ij ■ 


3. 


Check: 


/ 


If the messages Py and /iij did not 








converge (w.r.t. f^)^ return to Step 2. 






/ 


Else, continue to Step 4. 


4 . 


In fer: 


/ 


Compute the marginal means 








Aii = + Efc6N(») Pkzf^kt) / {Pvi + EfceN(«) • 






(/ 


Optionally compute the marginal precisions 








Pi = -Pit + X]feeN('i) ) 


5. 


Solve : 


/ 


Find the solution 








X* = fj-i, Vi. 



3 Proposed Solution of SVM Solver Based on GaBP 

For our proposed solution, we take the exponent of dual SVM formulation given in equation ([9]) 
and solve maxcxp(L/)(/i)). Since cxp{LD{h)) is convex, the solution of maxexp(L/)(/i)) is 
a global maximum that also satisfies maxL£)(ft,) since the matrix D is symmetric and positive 
definite. Now we can relate to the new problem formulation as a probability density function, 
which is in itself Gaussian: 

p{h) (X exp{-^h'Dh + h'l), 

where 1 is a vector of (1, 1, • ■ • , 1) and find the assignment of ft, = argmaxp(/i). It is known [5] 
that in Gaussian models finding the MAP assignment is equivalent to solving the inference 
problem. To solve the inference problem, namely computing the marginals h, we propose using 
the GaBP algorithm, which is a distributed message passing algorithm. We take the computed 
h as the Lagrange multiplier weights of the support vectors of the original SVM data points and 
apply a threshold for choosing data points with non-zero weight as support vectors. 

Note that using this formulation we ignore the remaining constraints (TU] [11] In other words we 
do not solve the SVM problem, but the kernel ridge regression problem. Nevertheless, empirical 
results presented in Section [5] show that we achieve very good classification vs. state-of-the-art 
SVM solvers. 

Finally, following [7], we remove the explicit bias term b and instead add another dimension to the 
pattern vector such that Xi = {xi,X2, ■ ■ ■ , xn, A), where A is a scalar constant. The modified 
weight vector, which incorporates the bias term, is written as w = {wi,W2, ■ ■ ■ ,wn ,b/ X). 
However, this modification causes a change to the optimized margin. Vijayakumar and Wu [7] 
discuss the effect of this modification and reach the conclusion that "setting the augmenting 
term to zero (equivalent to neglecting the bias term) in high dimensional kernels gives satisfactory 
results on real world data" . We did not completely neglect the bias term and in our experiments, 
which used the Radial Basis Kernel, set it to 1/N, as proposed in [5]. 
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3.1 GaBP Algorithm Convergence 



In order to force the algorithm to converge, we artificially weight the main diagonal of the kernel 
matrix D to make it diagonally dominant. Section [5] outlines our empirical results showing that 
this modification did not significantly affect the error in classifications on all tested data sets. 

A partial justification for weighting the main diagonal is found in [6J. In the 2-Norm soft margin 
formulation of the SVM problem, the sum of squared slack variables is minimized: 



where 5ij is the Kronecker 5 defined to be 1 when i = j, and zero elsewhere. It is shown that the 
only change relative to the 1-Norm soft margin SVM is the addition of 1/C to the diagonal of 
the inner product matrix associated with the training set. This has the effect of adding 1/C to 
the eigenvalues, rendering the kernel matrix (and thus the GaBP problem) better conditioned [6J. 

3.2 Convergence in Asynchronous Settings 

One of the desired properties of a large scale algorithm is that it should converge in asynchronous 
settings as well as in synchronous settings. This is because in a large-scale communication 
network, clocks are not synchronized accurately and some nodes may be slower than others, 
while some nodes experience longer communication delays. 

Recent work by Koller et. al [3] defines conditions for the convergence of belief propagation. 
This work defines a distance metric on the space of BP messages; if this metric forms a max- 
norm construction, the BP algorithm converges under some assumptions. Using experiments 
on various network sizes, of up to a sparse matrix of one million over one million nodes, the 
algorithm converged asynchronously in all cases where it converged in synchronous settings. 
Furthermore, as noted in [3], in asynchronous settings the algorithm converges faster as compared 
to synchronous settings. 

4 Algorithm Optimization 

Instead of sending a message composed of the pair of and Pij, a node broadcasts aggregated 
sums, and consequently each node can retrieve locally the Pi\j and from the sums by means 
of a subtraction: 

Instead of sending a message composed of the pair of and Pij, a node can broadcast the 
aggregated sums 



min ||w|j2 + CJ:,^f 



s.t. y.j(w • Xi + 6) > 1 - 



The dual problem is derived: 



W{h) 




(19) 



fceN(i) 




(20) 



A:6N(i) 



Consequently, each node can retrieve locally the Pi\j and from the sums by means of a 

subtraction 




The rest of the algorithm remains the same. 
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Algorithm 2 



1 . 


Initial! ze : 


•J 


Set the neighborhood N(i) to include 








Vfc id/lfei ^ U. 






/ 


Set the scalar fixes 








Pii = An and = foj/Aij, Vi. 






/ 


Set the initial i ^ N(i) broadcast messages 








-Pi = and [Li — 0. 






/ 


Set the initial N(i) 3 fc ^ i internal scalars 








Pki = and ^fci = 0. 








Set a convergence threshold e. 


2 . 


Iterate : 


/ 


Broadcast the aggregated sum messages 








Pj = -Pfii + X]fceN(i) 








— 1 

Ui = -Pi (PiiUii + \ PkiUki), Vz 








(under certain scheduling) . 






/ 


Compute the N(j) 3 i ^ j internal scalars 








p,,^~Ai/{h-p,i), 








tl^A — iP^ll^ — PaaII^a\IA^^ 


3. 


Check: 


/ 


If the internal scalars Pij and /i^ did not 








converge (w.r.t. e) , return to Step 2. 






/ 


Else, continue to Step 4. 


4 . 


Infer : 


/ 


Compute the marginal means 








[J-i = {Piil^ii + X]fegN(i) Pki/J-ki) / {Pii + X]fcGN(i) -^ki) — jj-i, Vi. 






Optionally compute the marginal precisions 








Pi = Pii + X]A:eN(i) ^ki ^ Pi ) 


5. 


Solve : 


/ 


Find the solution 








X* = fii, Vi. 



5 Experimental Results 

We implemented our proposed algorithm using approximately 1,000 lines of code in C. We 
implemented communication between the nodes using the MPICH2 message passing interface 
[?4j . Each node was responsible for d data points out of the total n data points in the dataset. 

Our implementation used synchronous communication rounds because of MPI limitations. In 
Section [6] we further elaborate on this issue. 

Each node was assigned several examples from the input file. Then, the kernel matrix D was 
computed by the nodes in a distributed fashion, so that each node computed the rows of the 
kernel matrix related to its assigned data points. After computing the relevant parts of the matrix 
D, the nodes weighted the diagonal of the matrix D, as discussed in Section [5~T1 Then, several 
rounds of communication between the nodes were run. In each round, using our optimization, a 
total of 71 sums were calculated using MPLAIIreduce system call. Finally, each node output the 
solution X, which was the mean of the input Gaussian that matched its own data points. Each 
Xi signified the weight of the data point i for being chosen as a support vector. 

To compare our algorithm performance, we used two algorithms: Sequential SVM (SVMSeq) 
[3 and SVMIight [16]. We used the SVMSeq implementation provided within the IBM Parallel 
Machine Learning (PML) toolbox [lOj. The PML implements the same algorithm by Vijaykumar 
and Wu [7j that our GaBP solver is based on, but the implementation in through a master-slave 
architecture as described in |8j. SVMIight is a single computing node solver. 

Table [1] describes the seven datasets we used to compare the algorithms and the classification 
accuracy obtained. These computations were done using five processing nodes (3.5GHz Intel 
Pentium machines, running the Linux operating system) for each of the parallel solvers. All 
datasets were taken from the UCI repository [llj. We used medium-sized datasets (up to 20,000 
examples) so that run-times using SVMIight would not be prohibitively high. All algorithms were 
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n 1 mpncicin 

LJ II 1 Id l3lwl 1 


1 r3 1 n 


Test 


Error (%) 


GaBP 


Sequential 


SVMIight 


Isolet 


617 


6238 


1559 


7.06 


5.84 


49.97 


Letter 


16 


20000 




2.06 


2.06 


2.3 


Mushroom 


117 


8124 




0.04 


0.05 


0.02 


Nursery 


25 


12960 




4.16 


5.29 


0.02 


Pageblocks 


10 


5473 




3.86 


4.08 


2.74 


Pen digits 


16 


7494 


3498 


1.66 


1.37 


1.57 


Spambase 


57 


4601 




16.3 


16.5 


6.57 



Table 1: Error rates of the GaBP solver versus those of the parallel sequential solver and SVMIight 



Dataset 


Run times (sec) 




GaBP 


Sequential 


Isolet 


228 


1328 


Letter 


468 


601 


Mushroom 


226 


176 


Nursery 


221 


297 


Pageblocks 


26 


37 


Pen digits 


45 


155 


Spambase 


49 


79 



Table 2: Running times (in seconds) of the GaBP solver (working in a distributed environment) 
compared to that of the IBM parallel solver 



run with an RBF kernel. The parameters of the algorithm (kernel width and misclassification 
cost) were optimized using a line-search algorithm, as detailed in [12]. 

Note that SVMIight is a single node solver, which we use mainly as a comparison for the accuracy 
in classification. 

Using the Friedman test [13J, we did not detect any statistically significant difference between 
the performance of the algorithms with regards to accuracy {p < ClO^-^). 

Figure[T]shows the speedup results of the algorithm when running the GaBP algorithm on a Blue 
Gene supercomputer. The speedup with nodes is computed as the run time of the algorithm 
on a single node, divided by the run time using N nodes. Obviously, it is desirable to obtain 
linear speedup, i.e., doubling computational power halves the processing time, but this is limited 
by the communication load and by parts of the algorithm that cannot be parallelized. Since 
Blue Gene is currently limited to 0.5 GB of memory at each node, most datasets could not be 
run on a single node. We therefore show speedup compared to two nodes. As the figure shows, 
in most cases we get a linear speedup up to 256 CPUs, which means that the running time is 
linearly proportional to one over the number of used CPUs. When using 512 - 1024 CPUs, the 
communication overhead reduces the efficiency of the parallel computation. We identified this 
problem as an area for future research into optimizing the performance for larger scale grids. 

We also tested the ability to build classifiers for larger datasets. Table [3] shows the run times 
of the GaBP algorithm using 1024 CPUs on two larger datasets, both from the UCI repository. 
This demonstrates the ability of the algorithm to process very large datasets in a reasonably 
short amount of time. We compare our running time to state-of-the-art parallel decomposition 
method by Zanni et al. [22\ and Hazan et al. . Using the MNIST dataset we where considerably 
slower by a factor of two, but in the larger Covertype dataset we have a superior performance. 
Note that it is hard to compare running times since the machines used for experimentation are 
different. Zanni used 16 Pentium IV machines with 16Gb memory, Hazan used 10 Pentium IV 
machines with 4Gb memory while we used a larger number of weaker Pentium IV machines with 
400Mb of memory. Furthermore, in the Covertype dataset we used only 150,000 data points 
while Zanni and Hazan used the full dataset which is twice larger. 
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SVM using GaBP Speed-up 




10 10 

Number of CPUs 



Figure 1: Speedup of the GaBP algorithm vs. 2 CPUS 



Dataset 


Dim 


Num of examples 


Run time GaBP (sec) 


Run time [22] (sec) 


Run time [23] 


Covertype 


54 


150,000/300,000 


468 


24365 


16742 


MNIST 


784 


60,000 


756 


359 


18 



Table 3: Running times of the GaBP solver for large data sets using 1024 CPUs on an IBM Blue 
Gene supercomputer. Running time results are compared to two state-of-the-art solvers: [22| 
and L23j. 



6 Discussion 

In this paper we demonstrated the application of the Gaussian Belief Propagation to the solu- 
tion of SVM problems. Our experiments demonstrate the usefulness of this solver, being both 
accurate and scalable. 

We implemented our algorithm using a synchronous communication model mainly because 
MPICH2 does not support asynchronous communication. While synchronous communication 
is the mode of choice for supercomputers such as Blue Gene, in many cases such as heteroge- 
neous grid environments, asynchronous communication will be preferred. We believe that the 
next challenging goal will be to implement the proposed algorithm in asynchronous settings, 
where algorithm rounds will no longer be synchronized. 

Our initial experiments with very large sparse kernel matrices (millions of data points) show that 
asynchronous settings converge faster. Recent work by Koller [3J supports this claim by showing 
that in many cases the BP algorithm converges faster in asynchronous settings. 

Another challenging task would involve scaling to data sets of millions of data points. Currently 
the full kernel matrix is computed by the nodes. While this is effective for problems with many 
support vectors [8J, it is not required in many problems which are either easily separable or else 
where the classification error is less important compared to the time required to learn the mode. 
Thus, solvers scaling to much larger datasets may have to diverge from the current strategy of 
computing the full kernel matrix and instead sparsify the kernel matrix as is commonly done in 
single node solvers. 
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Finally, it remains an open question whether SVMs can be solved efficiently in Peer-to-Peer 
environments, where each node can (efficiently) obtain data from only several close peers. Future 
work will be required in order to verify how the GaBP algorithm performs in such an environment, 
where only partial segments of the kernel matrix can be computed by each node. 
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